Deciphering polymorphism in 61,157 Escherichia coli genomes via epistatic sequence landscapes

Characterizing the effect of mutations is key to understand the evolution of protein sequences and to separate neutral amino-acid changes from deleterious ones. Epistatic interactions between residues can lead to a context dependence of mutation effects. Context dependence constrains the amino-acid changes that can contribute to polymorphism in the short term, and the ones that can accumulate between species in the long term. We use computational approaches to accurately predict the polymorphisms segregating in a panel of 61,157 Escherichia coli genomes from the analysis of distant homologues. By comparing a context-aware Direct-Coupling Analysis modelling to a non-epistatic approach, we show that the genetic context strongly constrains the tolerable amino acids in 30% to 50% of amino-acid sites. The study of more distant species suggests the gradual build-up of genetic context over long evolutionary timescales by the accumulation of small epistatic contributions.

Controlling for phylogenetic bias in DCA DCA assumes that amino-acid sequences observed in nature represent an (up to simple reweighting) independent sample from an unknown probability distribution. Phylogenetic relationships between members of the same protein family necessarily violate this hypothesis. To date there is no known method to incorporate phylogenetic relations into coevolutionary modeling of amino-acid sequences, but to assess the influence of phylogeny one can follow the methodology developed by Horta and Weigt [1]. Instead of trying to disentangle functional couplings from phylogenetic ones, they have proposed null models reproducing conservation and phylogenetic patterns observed in the original MSA, but removing any signature of coevolution between residues. Comparing results obtained from DCA models trained on true MSAs to those obtained with DCA models trained on corresponding randomized MSAs, we can assess whether "epistatic" signals detected by DCA are actually caused by phylogenetic correlations.
Here, we use null model II [1]: it randomizes MSA to have both similar position-specific frequencies of amino acids and similar pairwise Hamming distances between sequences (Supplementary Methods section MSA randomization procedure to control for phylogenetic couplings). The former ensures the conservation of the MSA profile, meaning that an IND model will give equivalent results if trained on the randomized MSA compared to the original one. The conservation of the pairwise distances between MSA sequences allows to reproduce phylogenetic patterns contained in the original MSA. On the contrary, no coevolutionary signal is contained in the randomized MSA, in difference to the original one. Due to the high computational cost of this randomization procedure, we have chosen to run it on a subset of 51 Pfam MSAs randomly selected among the 2,053 screened in our work. We observe that a DCA model trained on a randomized MSA generates a CDE which takes intermediate values between those of the original CDE and CIE (Supplementary Figure 10a). This reduction in entropy compared to the entropy of an independent model is an expected outcome of the addition of phylogenetic couplings into the DCA modeling procedure. However, we note that phylogenetic couplings are not sufficient to explain the very low CDE observed in the original data. This indicates that structural and functional coevolution play an essential role in constraining CDE values.
In our work, we have used CDE to predict polymorphic and conserved sites in E. coli. We have taken care of excluding any sequence with more than 90% identity with E. coli reference strain. Given that the average variability within E. coli species only reaches 2%, DCA predictions of polymorphisms give a way to assess its out-of-sample performance. The patterns of conservation and variability within E. coli are the results of selection, not phylogeny. This is illustrated by the lower performance of DCA models trained on randomized MSAs in predicting polymorphisms compared to the predictions made by IND models and by DCA models trained on original MSAs (Supplementary Figure 10b). In contrast, DCA models trained on original MSAs outperform IND models, meaning that their predictions rely more on structural or functional coevolutionary couplings than on phylogenetic correlations.
Taken altogether, these results suggest that -even if DCA may incorporate phylogenetic couplings -these do not account for the majority of the signals we detect. These findings are in line with conclusions in [1] and DCA's ability to predict functional protein variants [2].

Supplementary Methods
MSA randomization procedure to control for phylogenetic couplings Among the 2,053 inter-species Pfam MSAs used to train models to study inter-strain data, 51 MSAs were randomly selected. Using published algorithms from Horta and Weigt [1], 51 randomized MSAs were produced with Null model II. The corresponding DCA models were trained on these MSAs and the CDE was computed with the randomized reference sequence taken for context. Independent models and DCA models trained with the corresponding original MSAs were used to compare their performance to those of the DCA models trained on randomized MSAs.   The best λ parameter is 3.6. Square loss is presented as average value across 20 simulations ±1 standard deviation. (b) Simulation of synonymous diversity. Average results of the 20 simulations of synonymous diversity with λ = 3.6. We have focussed on sites where there are exactly four possible 1-SNP synonymous mutations. As we can see synonymous diversity is not saturated (sites with all four possible synonymous codons observed in the dataset are rare). Simulations (in orange) achieve good fit of the observed reality (in blue) even with a basic model like JC69 that ignores differences in mutation rates between nucleotide pairs. (c) Simulation of non-synonymous diversity. Bivariate histogram of CDE and CIE for sites that are conserved in the simulated dataset produced with parameter λ = 3.6. Most of the sites cluster on the left peak of low CDE. However, as observed in the real dataset, some of the sites where no mutation occurred have a high CDE. (d) Comparison of CDE distributions of real conserved sites (sites conserved across >60,000 strains in the dataset, in blue) and simulated conserved sites (sites where no mutation was simulated, in orange).